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1. INTRODUCTION 

Ransomware is a type of malicious software that blocks users from accessing their device or personal 
data and requests ransom payment to gain access to their device. Since the first appearance of this kind in late 
of the 1980s till now, the ransomware witnessed a serious development that enabled the hackers to move from 
the personal blackmail to a high level of corporate blackmail. Therefore detecting this type of attack is a 
difficult technical problem [1]. The estimated cost of ransomware damage for 2017 was estimated at $5 billion, 
and 2019 is expected to hit $11.5 billion [2]. The Herjavec Group estimated that cybercrime will cost USD 6 
trillion by 2021 [3]. In addition to major financial losses, since 2017 the risk of victimization of ransomware 
has risen by 97 percent [4] and the trend continues, reported that by the end of 2019 ransomware will strike a 
company every 14 s dropping to 11s by 2021. 

In the current paper, static analysis to detect ransomware attack by extracting features directly from 
binary files of 32 bits size in the reprocessing stage. A gain ratio feature selection method has been used to 
select the best features that can be used to distinguish between ransomware and goodware samples. Besides, 
three different classification models have been used namely; (decision tree (J48), random forest (RF), radial 
basis function network (RBF)) which used the supervised learning algorithms. 

The classification models are trained using 50 percent of collected ransomware files and goodware 
files, while the other 50 percent group is used for testing the models. The results revealed that random forest 
classifier is more effective in term of accuracy and time consuming compared to other classification 
models.The remaining parts of the article are organized as follows: section 2 addresses the related work in 
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ransomware detection. Section 3 thoroughly discusses the proposed approach and describes the pre-processing, 
features extraction, feature selection, and machine learning classifiers in a comprehensive manner. Section 4 
presents the dataset collection. Section 5 describes the simulation performance as well as the experimental 
results with the description of the evaluation studies. Finally, section 6 comprises the concluding remarks. 


2. RELATED WORKS 

The network security was focused attention by researchers since attacks on the computer networks 
became as major threats to different sectors including single user, corporate, and governmental institutions [5]. 
One of the most dangerous attacks is ransomware where the attacker encrypts and locks the victim's files or 
systems and then claims a payment to unlock and decrypt files. Many researchers studied different techniques 
to detect a ransomware attack. 

Kharaz et al. [6] introduced a dynamic analysis system named UNVEIL (univeil). This technique 
monitors filesystem input/output (I/O) activity using the Windows filesystem mini-filter driver framework. 
They revealed that the system has the ability to distinguish the behavior of ransomware such as malicious 
encryption of files. Besides, they showed that the proposed technique could detect 13,637 ransomware samples 
with zero false positives from various families. Sgandurra et al. [7] presented a detection technique named 
EldeRan which is a dynamically based analysis using Sandbox. This technique monitors a set of relevant 
features in the first 30 seconds of the ransomware execution time. Mutual Information criterion was used as a 
feature selection method to select the most discriminating features. Furthermore, they utilized the regularized 
logistic regression classifier for the classification process. Their result achieved an area under the curve around 
0.995, but at the same time, the result has a relatively high false positives ratio. Weckstén et al. [8] used the 
file system activity, registry manipulation, software process monitor, and regshots for tracking the processing 
activity in zeltzers. They found that the crypto-ransomware attacks depend on the executable file of 
"vssadmin.exe". 

Vinayakumar et al. [9] built a system that collects the application programming interface (API) 
sequences from a sandbox to implement the dynamic analysis. Seven ransomware families have been used in 
the experiments. They employed machine learning technique represented by multilayer perceptron (MLP) for 
the classification process. The outcomes achieved a detection accuracy of around 98%. Chen et al. [10] 
designed a generative adversarial network (GAN) that can automatically extract dynamic features of 
ransomware samples. They utilized these features in different classifiers such as; (extreme gradient boosting 
(XGB), linear discriminant analysis (LDA), random forest, naive Bayes, and support vector machine (SVM). 
The results attain an accuracy of 99%. Takeuchi et al. [11] applied dynamic analysis to detect ransomware by 
looking at the API call history run in Sandbox. They extracted API calls as features of ransomware and used 
SVM to classify the dataset which contains 312 goodware and 276 ransomware files. The experiments 
manifested an accuracy is approximately 97.48%. 

Al-rimy et al. [12] established an ensemble-based detection model to crypto-ransomware. They 
combine between semi-random subspace selection (ESRS) and incremental bagging (Bagging). They 
compared their results with many classifiers including AdaBoost, RF, decision tree (DT), linear regression 
(LR), k-nearest neighbors (kKNN), and SVM. The results showed an accuracy of around 0.97 when 20 features 
have been used in the proposed system. Homayoun et al. [13] combined between machine learning with 
sequential pattern mining to find maximal sequential patterns (MSP). The dataset contains 220 goodware 
samples and 1624 ransomware samples. The study comprised four classifiers namely, J48, random forest, 
bagging, and MLP. Their findings achieved about 99% accuracy. 

Alhawi et al. [14] suggested a machine learning analysis model called a NetConverse. They extracted 
features from ransomware samples traffic. Besides, they used six types of machine learning classifiers by 
Waikato Environment for Knowledge Analysis (WEKA) machine learning tool. They utilized 210 samples 
from 9 ransomware families and 264 samples for goodware. They found that the decision tree (J48) classifier 
could attain a true positive ratio (TPR) of around 97.1%. Baldwin and Dehghantanha [15] also employed static 
analysis. They used SVM machine learning technique to classify five crypto-ransomware families and 
goodware. They have extracted opcode features to be used in the learning process. The outcomes emphasized 
an accuracy of 96.5%. Zhang et al. [16] proposed an approach using static analysis for ransomware 
classification. The technique is based on the extraction of the opcode sequences to initiate the n-gram sequences 
from ransomware samples and calculate the term frequency-inverse document frequency (TF-IDF) to generate 
feature vectors. Then, five machine learning methods are used for classification purposes. The accuracy of the 
proposed technique showed a percentage of 91.43%. Some works use a hybrid system that combines static and 
dynamic analyses such as in [17]-[19]. 

Shaukat and Ribeiro [17] built a system using strong trap layer and machine learning. The experiment 
analysis the proposed system using 74 samples from 12 cryptographic ransomware families. The best result using 
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gradient tree boosting algorithm has been got a detection rate around 98.25%. Meanwhile, Subedi et al. [18] 
developed an analysis tool named crypt-ransomware-static (CRSTATIC) which create dynamic-link library 
(DLLs) libraries from input binary programs. A data-mining technique was used to generate association rules 
of these DLLs. Ferrante et al. [19] also built a hybrid system contained the static detection method and a 
dynamic detection method. The static approach utilized the frequency of opcodes, while the dynamic detection 
method utilized system call statistics, memory usage, central processing unit (CPU) usage, and network usage 
to detect android ransomware. The false-positive rate attained less than 4%. The motivation of the current study 
is to analyze the ability of machine learning to detect ransomware using features extracted directly from the 
binary file, and the top frequent features extracted from ransomware files have been added to the top frequent 
features extracted from snort malware signatures. 


3. METHODOLOGY 

There is a need for a new technique that can be used in advanced security equipment which can be 
able to detect the security threats of ransomware. This article investigates the ability of machine learning 
techniques to detect ransomware by comparing three different classifiers using the proposed approach. The 
proposed approach, as shown in Figure 1, included three major stages. The first stage comprised a 
preprocessing of the dataset, while the second stage involved feature selection. The third stage implicated the 
use of three different classifiers to detect ransomware. 
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Figure 1. The framework for ransomware attack detection 


In the proposed novel method, the features are extracted directly from binary files with the use of 
static analysis and eliminate the step of disassembling to get the opcode features. Then a preprocessing step is 
used to prepares the dataset and create the features vectors. This step is essentially needed because some of the 
symbolic features included in the raw dataset prohibiting the classifier to process this data. In the pre-processing 
step, the symbolic features are eliminated or changed as they do not signify crucial involvement in attack 
detection. Besides, these features involve undesirable effects such as increasing training time, wasted 
computing resources and memory, and further complexity to classifier's architecture [20]. 

The pre-processing step involved several sub-process. First, the raw bytes in each file is divided into 
a fixed-size sliding window (32-bits) in order to extract the features, since dealing with bytes is more 
straightforward and faster than using opcode features [21 ]-[23]. The feature size of 32 bit has been adopted in 
the current study because it produces significant results in malware detection [24]-[27]. Secondly, a counting 
process for the frequency of each feature in these files is implementing. According to Homayoun et al. [13], 
there are common features available in each ransomware family. Therefore, the current work focused to select 
these important features by analyzing each ransomware file using the counting process. The third sub-process 
is a normalization step which is necessary to create the feature vectors according as shown in (1). 
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nij 





Nt = 
Èh Nh,j 


(1) 
Where Nt is the normalized frequency, ))p, Np, j is the total number of features in a file, and n; ; is the frequency 
of specific features. 

The second stage of the proposed method is the feature selection process which is considered an 
important part of the machine learning technique. It's generally used for improving the effectiveness of all the 
data mining algorithms and the performance of data classification [28]. The major function of feature selection 
is minimizing the dimensionality of features by eliminating irrelevant features. In current work, the gain ratio 
(GR) feature selection method has been employed where the top 1000 features are selected based on this feature 
selection method. 

The third stage in the current proposed approach is the classification process. Three different classifiers 
are examined in order to find the best classifier for the detection of ransomware. These classifiers comprising 
decision tree (J48), random forest (RF), and radial basis functions (RBF) which have been applied using WEKA 
tool (an open-source graphical user interface (GUI) based machine learning tool). The decision tree is an 
algorithm that creates a hierarchical set of rules based on minimizing classification error developed by 
Quinlan [29]. The random forest algorithm is combining the results of many decision trees in order to identify the 
optimal set of rules that minimize the classification error. It randomly selects subsamples of features iteratively to 
train multiple decision trees and then built the classifier which can predict in the testing phase [30]-[32]. 

The radial basis functions (RBF) is a supervised learning technique that minimizing squared error. It 
is a neural network that has radially symmetric functional activations in the hidden layer, which means its 
output depends on the distance between the input data vector and the weight vector, called the center [33]. The 
fitness function measured is utilized to reach the best accuracy in radial basis function network (RBFN). Many 
fitness functions can be used to measure an error. The mean square error (MSE) has been used in current 
research. The pseudo-code of the proposed method which describes the procedure of selecting the important 
features and the pseudo-code for the comparison of the machine learning models is illustrated as shown in 
Algorithm 1 and Algorithm 2 respectively. 


Algorithm 1 The pseudocode of the proposed method for selecting the important 
features. 

1: T: Total dataset files. 

: Gi: Goodware files G; c T 

: R;: Ransomware files R; c T 

: Sm: Snort n-gram malware features. 

h: Total number of featues. 

6: Fy: Total important features. 

7: nij: frequency of specific n-gram features Fijin T; 

8: F,: ransomware n-gram features C R; 
9 
l 


L & Wb 


: Fj: goodware n-gram features C G; 
0: Nt: normalize term frequency of specific feature. 
11: F; = F, —(F-9F;,) 
12: F; = Fi + Sm 
13: While (!EOF T;) do 
14: For (each F;;) do 
15: Ni = oo 
16: End for 
17: End While 


Algorithm 2 The pseudocode for comparison of machine learning. 
1: Procedure classifier ( ) 

2: T : Total dataset files. 

3: Tyn: training dataset 50% of T 

4: Ta: testing dataset 50% of T , (T,.,, A Tet = &) 

5: Input F; 

6: Fa: Top 1000 features selected sing Gain Ratio Fa C Fi 
7: Produce the classifier 

8: For each Fy 

9: Provide Fyto RF, J48, and RBF using Tn 

10: Calculate 

11: Arr = RF accuracy 

12: A, = J48 accuracy 

13: Appr = RBF accuracy 

14: Compare the accuracy of Apr, AJ, and Arggr 

15: Select the best classifier to classify Ts 
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4. DATASET COLLECTION 

Two types of executable files are used in the present study: ransomware executable files and goodware 
executable files. The ransomware files are downloaded from virustotal [34], while the goodware files are 
collected from the portable apps platform [35] and windows platform. The total number of ransomware files is 
840 from three different families of ransomware; Cerber, Locky, and TeslaCrypt similar to [36]. The collected 
goodware files have almost the same size as ransomware files and the same number of 840 files. Virustotal.com 
has been used to check the goodware and ransomware. 50% of the dataset is used in the training stage, while 
the rest 50% of the dataset is used in the testing stage in order to avoid the problem of the imbalanced dataset. 
In the present work, two operating systems have been used to implement the proposed method and getting the 
results. The first one is Windows 10, Core 17 CPU with 8 core, and 16 GB of RAM. The second operating 
system is Linux 4.1. 


5. EXPERIMENTAL RESULTS AND ANALYSIS 

One of the challenges that face the researchers in the detection system is the scalability which involves; 
high storage requirements, more-time for implementation, and complexity. To avoid the scalability effects, 
different sizes of attributes are tested using GR to find the best size that offers higher accuracy in reasonable 
feature size. The number of 1000 attributes is found to be the best in terms of accuracy and time-consume. 
Figure 2 shows the simulation of the training and testing stages for the classifiers used in the proposed method. 

In order to study the effectiveness of the classifiers, the false positive ratio (FPR), false negative ratio 
(FNR), true negative ratio (TNR), true positive ratio (TPR), and accuracy have been used in current work [36], 
as follows: 





TP FP Pa TP 
TPR or Recall = , FPR = ——., Precision = —— 
TP+FN FP+TN TP+FP 
N TP+TN Precision*Recall 
TNR = —— Accuracy = F — Measure = 2 * aaa) 
TN+FP TP+FP+TN+FN Precisiton+Recall 


Where: 

True positive (TP): the number of attack files that are exactly predicted as attack files. 

True negative (TN): the number of goodware files that are exactly classified as goodware files. 
False positive (FP): the number of goodware files that are incorrectly predicted as attack files. 
False negative (FN): the number of attack files that are incorrectly predicted as goodware files. 
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Figure 2. The simulation of the training and testing phase 


To measure the accuracy of detection for different classifiers, the experiments are set to the default 
number for all the parameters of the different classifiers. The result of the detection accuracy using a different 
number of the attribute (from 1000 to 7000) is shown in Figure 3 which illustrates the best accuracy (97.73%) 
when using RF with 1000 attributes. Figure 4 shows the time needs for different classifiers to predict the testing 
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dataset when the size of attributes is within the range from (1000 to 7000). The results of attributes less than 
(<1000) and more than (>7000) are not included in the current analysis because the detection accuracy for 
these ranges is very low for different classifiers. This is in line with [24] which mentioned that using a large 
number of attributes declines the accuracy to build the classifier model. Figure 4 depicts that the faster classifier 
in detection is J48 (0.54 sec.) for different sizes of attributes, while RBF shows the highest time 
(2.2 sec.) for a prediction than RF. Although the RF time prediction (1.49 sec.) is not the lowest, its highest 
accuracy makes it prevalent over other classifications. Figures 5, 6, 7, and 8 demonstrate the trends of the 
recall, the precision, f-measure, and receiver oprating characteristic (ROC) respectively, of the different 
classifiers using the different number of attributes. 
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Figure 3. The accuracy of different classifiers using Figure 4. Time of different classifier to predict the 
different sizes of attributes testing dataset using different number of attributes 
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Figure 5. The recall for different classifiers witha Figure 6. The precision of different classifiers with a 
different number of attributes different number of attributes 
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It can be seen that the random forest achieved the best results for all previous parameters for different 
sizes of attributes as follows: (f-measure is 97.8, recall is 99.8, ROC is 99.6, precision is 95.9). At the same 
time, the result of the random forest shows that when the number of attributes increases then values of the 
recall, precision, f-measure, and ROC will be decreased. This finding shows that the number of attributes has 
a significant effect on the classifier accuracy because some of the irrelevant attributes or features in data can 
decrease the accuracy [24]. 

The FNR, FPR, and TNR are shown in Figures 9, 10, 11 respectively. As it is evident, the random 
forest has the highest TNR (0.957), the lowest FPR (0.043), and the lowest FNR (0.002). To compare the 
present work with other previous researches, Table 1 shows a comparison with the most related works. It can 
be seen a privilege of the proposed method over the other methods of [24] and [16]. 
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Table 1. The comparison with other related works 


Method Analyzing type Features type Classifier. Result accuracy 
Zhang et al. [16] static Opcod (n-gram) RF 91.4% 
Baldwin and Dehghantanha [15] static Opcod SVM 96.5% 
Present work static Binary (n-gram) RF 97.7% 


6. CONCLUSION 

Present work aimed to utilize the ability of machine learning techniques in a detection ransomware 
attack. The importance of this paper relies on using the features extracted directly from the raw byte of the 
executable file with the use of machine learning techniques. Three classification algorithms have been utilized 
in the current study including random forest, J48, and radial basis functions network. Its found that random 
forest is most precise in detection ransomware using the proposed method. The most suitable size was found 
to be 1000 attributes in the feature selection process. 
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The results illustrated that the random forest achieved the best results of all the measured parameters 
for different sizes of attributes as follows: (f-measure is 97.8, recall is 99.8, ROC is 99.6, and precision is 95.9). 
At the same time, these results revealed that when the number of attributes increases then the values of the 
recall, precision, f-measure, and ROC will be decreased. This finding referred that the number of attributes has 
a significant effect on the classifier accuracy because some of the irrelevant attributes or features in data can 
decrease the accuracy. The privilege of the proposed method is manifested in the direct extraction of features 
from binary files without the need of using opcode features which takes more time in the reprocessing stage 
due to the disassemble process. 
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